-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Optimize Native unsupervised FastText #1742
Conversation
gensim/models/fasttext.py
Outdated
@@ -9,77 +9,83 @@ | |||
from gensim.models.word2vec import Word2Vec, train_sg_pair, train_cbow_pair | |||
from gensim.models.wrappers.fasttext import FastTextKeyedVectors | |||
from gensim.models.wrappers.fasttext import FastText as Ft_Wrapper, compute_ngrams, ft_hash | |||
from gensim import matutils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
gensim/models/fasttext.py
Outdated
from gensim.models.fasttext_inner import FAST_VERSION, MAX_WORDS_IN_BATCH | ||
|
||
except ImportError: | ||
# failed... fall back to plain numpy (20-80x slower training than the above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth logging a warning?
gensim/models/fasttext.py
Outdated
@@ -89,12 +95,10 @@ def __init__(self, sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, | |||
if self.word_ngrams <= 1 and self.max_n == 0: | |||
self.bucket = 0 | |||
|
|||
super(FastText, self).__init__( | |||
sentences=sentences, size=size, alpha=alpha, window=window, min_count=min_count, | |||
super(FastText, self).__init__(sentences=sentences, size=size, alpha=alpha, window=window, min_count=min_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: Arguments on first line forbidden when not using vertical alignment.
https://www.python.org/dev/peps/pep-0008/#indentation
gensim/models/fasttext.py
Outdated
sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=MAX_WORDS_IN_BATCH): | ||
def __init__( | ||
self, sentences=None, sg=0, hs=0, size=100, alpha=0.025, window=5, min_count=5, | ||
max_vocab_size=None, word_ngrams=1, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loss
parameter is unused.
gensim/models/fasttext.py
Outdated
"You cannot do an online vocabulary-update of a model which has no prior vocabulary. " | ||
"First build the vocabulary of your model with a corpus before doing an online update." | ||
) | ||
raise RuntimeError("You cannot do an online vocabulary-update of a model which has no prior vocabulary. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect indentation
gensim/models/fasttext.py
Outdated
@@ -245,4 +234,4 @@ def load_fasttext_format(cls, *args, **kwargs): | |||
|
|||
def save(self, *args, **kwargs): | |||
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm', 'syn0_vocab_norm', 'syn0_ngrams_norm']) | |||
super(FastText, self).save(*args, **kwargs) | |||
super(FastText, self).save(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: No newline at the end of file.
gensim/models/fasttext.py
Outdated
-1.0 / self.vector_size, 1.0 / self.vector_size, | ||
(len(self.wv.vocab) - self.old_vocab_len, self.vector_size) | ||
) | ||
new_vocab_rows = rand_obj.uniform(-1.0 / self.vector_size, 1.0 / self.vector_size, (len(self.wv.vocab) - self.old_vocab_len, self.vector_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of reformatting this and the other lines? It makes the lines too long and the code is harder to read.
return 2 | ||
|
||
FAST_VERSION = init() # initialize the module | ||
MAX_WORDS_IN_BATCH = MAX_SENTENCE_LEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No newline at the end of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some weird bug is causing the new line to be not reflected at the end of .pyx
files even though my local commit shows it. Similar is the case with word2vec_inner.pyx
.
# coding: utf-8 | ||
|
||
import cython | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to Cython, but isn't this redundant given the cimport
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import
gives access to numpy's Python functions while the cimport
allows us to use it's C-modules.
gensim/models/fasttext_inner.pyx
Outdated
|
||
DEF MAX_SENTENCE_LEN = 10000 | ||
DEF MAX_SUBWORDS = 1000 | ||
from word2vec import FAST_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the point of this import?
gensim/models/fasttext_inner.pyx
Outdated
for m in range(j, k): | ||
if m == i: | ||
continue | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else
redundant. Better to leave it out. You save one level of indentation.
gensim/models/fasttext_inner.pyx
Outdated
our_saxpy(&size, &ONEF, &syn0_ngrams[subwords_idx[m][d] * size], &ONE, neu1, &ONE) | ||
|
||
if count > (<REAL_t>0.5): | ||
inv_count = ONEF/count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put spaces between operators. Here and elsewhere.
gensim/models/fasttext_inner.pyx
Outdated
for m in range(j,k): | ||
if m == i: | ||
continue | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else
redundant
gensim/models/fasttext_inner.pyx
Outdated
for m in range(j, k): | ||
if m == i: | ||
continue | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else
redundant
gensim/models/fasttext_inner.pyx
Outdated
if hs: | ||
fast_sentence_sg_hs(points[j], codes[j], codelens[j], syn0_vocab, syn0_ngrams, syn1, size, subwords_idx[i], subwords_idx_len[i], _alpha, work, l1, word_locks_vocab, word_locks_ngrams) | ||
if negative: | ||
next_random = fast_sentence_sg_neg(negative, cum_table, cum_table_len, syn0_vocab, syn0_ngrams, syn1neg, size, indexes[j], subwords_idx[i], subwords_idx_len[i], _alpha, work, l1, next_random, word_locks_vocab, word_locks_ngrams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a hard limit for line length, but this is way too long. IMO, it's a good idea to keep line length within 100 chars. Going a little over maybe OK where difficult to break, but this is clearly not the case.
gensim/models/fasttext_inner.pyx
Outdated
if count > (<REAL_t>0.5): | ||
inv_count = ONEF/count | ||
if cbow_mean: | ||
sscal(&size, &inv_count, neu1, &ONE) # (does this need BLAS-variants like saxpy?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see these comments and the TODOs are copied over fro word2vec_inner
. Is now a good time to addressing them?
gensim/models/fasttext.py
Outdated
from gensim.models.fasttext_inner import train_batch_sg, train_batch_cbow | ||
from gensim.models.fasttext_inner import FAST_VERSION, MAX_WORDS_IN_BATCH | ||
|
||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this solution is that it makes it impossible to test the Python native implementation. There's a number of ways you could fix this. The first thing that comes into my mind is this:
- add another module
fasttext_native
orfasttext_legacy
and define the native version oftrain_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH
there - import as follows:
try:
from gensim.models.fasttext_inner import train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH
except ImportError:
# log warning
from gensim.models.fasttext_native import train_batch_sg, train_batch_cbow, FAST_VERSION, MAX_WORDS_IN_BATCH
- add two new params to
FastText.__init__()
fortrain_batch_sg
andtrain_batch_cbow
functions. The default values would be the imported functions.
To test the native/legacy implementation, you would then be able to initialize FastText
as follows:
from gensim.models.fasttext import FastText
from gensim.models.fasttext_native import MAX_WORDS_IN_BATCH, train_batch_sg, train_batch_cbow
ft = FastText(..., batch_words=MAX_WORDS_IN_BATCH, train_batch_sg=train_batch_sg, train_batch_cbow=train_batch_cbow)
# test ft
I don't insist on this particular solution. There may be more elegant ways. My main concern is that the native implementation should also be testable and covered by unit tests (just parametrize existing unit tests to test both the native and Cython implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native Python implementations will be removed in the next refactor. We're not going to support them any more, now that @menshikh-iv can build good wheels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that this will be true @piskvorky, need to distribute our wheels several releases in row first, and if all will be OK - we can remove python parts (but this process is not fast, I suggest not rushing with this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janpom this code looks repeat logic from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L137.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is there's no point introducing new constructs for testing both versions, when we know we only want one version.
gensim/models/fasttext.py
Outdated
This module allows training a word embedding from a training corpus with the additional ability | ||
to obtain word vectors for out-of-vocabulary words. | ||
|
||
For a tutorial on gensim's native fasttext, refer the noteboook -- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer TO the notebook
gensim/models/fasttext.py
Outdated
@@ -1,5 +1,27 @@ | |||
#!/usr/bin/env python | |||
# -*- coding: utf-8 -*- | |||
# | |||
# Copyright (C) 2013 Radim Rehurek <me@radimrehurek.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem right -- and FastText didn't even exist yet in 2013 :)
@manneshiva You can use this block:
# Authors: Shiva Manne <s.manne@rare-technologies.com>
# Copyright (C) 2017 RaRe Technologies s.r.o.
Looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please be accurate with docstrings, we want to have clear and same(numpy-style) docstrings for all codebase
gensim/models/fasttext.py
Outdated
|
||
def train_batch_cbow(model, sentences, alpha, work=None, neu1=None): | ||
""" | ||
Update CBOW model by training on a sequence of sentences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy style docstrings (here and everywhere), useful links: http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html, https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will make the changes.
gensim/models/fasttext.py
Outdated
negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, sorted_vocab=1, bucket=2000000, | ||
trim_rule=None, batch_words=MAX_WORDS_IN_BATCH): | ||
""" | ||
Initialize the model from an iterable of `sentences`. Each sentence is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add examples of usage (small executable pieces of code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @manneshiva 💣 !!
Please fix docstring format and I'll merge this.
gensim/models/fasttext.py
Outdated
max_vocab_size=None, word_ngrams=1, loss='ns', sample=1e-3, seed=1, workers=3, min_alpha=0.0001, | ||
negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, min_n=3, max_n=6, | ||
sorted_vocab=1, bucket=2000000, trim_rule=None, batch_words=MAX_WORDS_IN_BATCH): | ||
"""Class for training, using and evaluating word representations learned using method described in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style format for docstrings (everywhere).
…instead of `from gensim.models.fasttext ...`)
Optimizes and speeds up -- by Cythonizing the native
FastText
pure Python implementation.